Goto

Collaborating Authors

 finite-time guarantee


We thank all the reviewers for their encouraging comments

Neural Information Processing Systems

We thank all the reviewers for their encouraging comments. In both these cases, τ is effectively zero. Liu et al. shows how GTD-class algorithms can be formally derived using a primal-dual saddle point Sutton et al. presents a (single time-scale) variant of linear TD learning, which they call emphatic TD and show that They also provide an asymptotic convergence analysis to the set of local optima. If the paper is accepted, we will work further on improving the clarity of the work.


Preference-based Reinforcement Learning with Finite-Time Guarantees

Neural Information Processing Systems

Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning by preferences to better elicit human opinion on the target objective, especially when numerical reward values are hard to design or interpret. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. In this paper, we present the first finite-time analysis for general PbRL problems. We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for PbRL. If preferences are stochastic, and the preference probability relates to the hidden reward values, we present algorithms for PbRL, both with and without a simulator, that are able to identify the best policy up to accuracy $\varepsilon$ with high probability. Our method explores the state space by navigating to under-explored states, and solves PbRL using a combination of dueling bandits and policy search. Experiments show the efficacy of our method when it is applied to real-world problems.


Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees

Neural Information Processing Systems

Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy that best fits observed sequences of states and actions implemented by an expert. Many algorithms for IRL have an inherent nested structure: the inner loop finds the optimal policy given parametrized rewards while the outer loop updates the estimates towards optimizing a measure of fit. For high dimensional environments such nested-loop structure entails a significant computational burden. To reduce the computational burden of a nested loop, novel methods such as SQIL \cite{reddy2019sqil} and IQ-Learn \cite{garg2021iq} emphasize policy estimation at the expense of reward estimation accuracy. However, without accurate estimated rewards, it is not possible to do counterfactual analysis such as predicting the optimal policy under different environment dynamics and/or learning new tasks.



Review for NeurIPS paper: Preference-based Reinforcement Learning with Finite-Time Guarantees

Neural Information Processing Systems

This paper generated considerable discussion among the reviewers. One the positive side, this paper makes a solid contribution to the emerging literature on preference-based RL, a topic of some importance and makes some interesting insights (e.g., on the potential lack of a "winning policy") and novel algorithmic contributions. Conversely, some reviewers raised issues with some of the assumptions made in the paper and the presentation (which seems to assume familiarity with PBRL and its motivations/rationale. The author response was thoughtful and generated some discussion (some of which is not reflected in the reviews, a couple of which failed to get updated unfortunately). On my own reading if the paper, I agree that the paper makes a useful contribution to PBRL, especially from a technical perspective and conceptual perspective (although I don't believe it makes PBRL more practical at this stage).


Review for NeurIPS paper: Preference-based Reinforcement Learning with Finite-Time Guarantees

Neural Information Processing Systems

Weaknesses: There are two main weaknesses. First, I'm not sure whether the algorithm is meant to be the core contribution, or the analysis. If it's the algorithm, then the paper needs to actually test the algorithm in more than toy settings (and ideally with real humans, rather than simulating answers with BLT with two parameter settings). But if it's the analysis, I almost feel like the experiments are distracting, or at least overstating and drawing away from the main contributions. I'd love to hear the authors' perspective on this, but my suggestion would be to either a) get the best of both worlds by running a more serious experiment, or b) edit the paper to highlight the analysis and justify the experiments as showing what the algorithm does empirically and perhaps aiding with some qualitative analysis of the resulting behavior when applied to simple tasks, aiding in the understanding of the algorithm.


Preference-based Reinforcement Learning with Finite-Time Guarantees

Neural Information Processing Systems

Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning by preferences to better elicit human opinion on the target objective, especially when numerical reward values are hard to design or interpret. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. In this paper, we present the first finite-time analysis for general PbRL problems. We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for PbRL. If preferences are stochastic, and the preference probability relates to the hidden reward values, we present algorithms for PbRL, both with and without a simulator, that are able to identify the best policy up to accuracy \varepsilon with high probability.


Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees

Neural Information Processing Systems

Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy that best fits observed sequences of states and actions implemented by an expert. Many algorithms for IRL have an inherent nested structure: the inner loop finds the optimal policy given parametrized rewards while the outer loop updates the estimates towards optimizing a measure of fit. For high dimensional environments such nested-loop structure entails a significant computational burden. To reduce the computational burden of a nested loop, novel methods such as SQIL \cite{reddy2019sqil} and IQ-Learn \cite{garg2021iq} emphasize policy estimation at the expense of reward estimation accuracy. However, without accurate estimated rewards, it is not possible to do counterfactual analysis such as predicting the optimal policy under different environment dynamics and/or learning new tasks.


Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

Neural Information Processing Systems

Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete com- binatorial optimization where the optimization variables can assume only a finite set of possible values. We introduce a new general formulation of simulated an- nealing which allows one to guarantee finite-time performance in the optimiza- tion of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory.


Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees

Zeng, Siliang, Li, Chenliang, Garcia, Alfredo, Hong, Mingyi

arXiv.org Artificial Intelligence

Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy that best fits observed sequences of states and actions implemented by an expert. Many algorithms for IRL have an inherently nested structure: the inner loop finds the optimal policy given parametrized rewards while the outer loop updates the estimates towards optimizing a measure of fit. For high dimensional environments such nested-loop structure entails a significant computational burden. To reduce the computational burden of a nested loop, novel methods such as SQIL [1] and IQ-Learn [2] emphasize policy estimation at the expense of reward estimation accuracy. However, without accurate estimated rewards, it is not possible to do counterfactual analysis such as predicting the optimal policy under different environment dynamics and/or learning new tasks. In this paper we develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm provably converges to a stationary solution with a finite-time guarantee. If the reward is parameterized linearly, we show the identified solution corresponds to the solution of the maximum entropy IRL problem. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks.